Improving OCR Performance With Word Image Equivalence

نویسندگان

Tao Hong

Jonathan J Hull

چکیده

OCR is an error prone process when input images are degraded Most current OCR techniques use linguistic information such as character n grams or dictionaries to postprocess character recognition results These methods essentially discard the input image after the character recognition is complete This paper proposes a new technique for improving the performance of an OCR system that uses information about equivalent word images inside a document Words that are repeated inside a document are grouped into clusters by an image matching algorithm The decisions of an OCR algorithm about the identities of those words are used to generate a common recognition result for each of the original word images This technique thus combines information from the document image word image clusters with recognition results to correct errors made by OCR systems on di erent instances of the same word Experimental results are presented that show about of the words in a document are repeated two or more times A clustering algorithm is able to reliable locate a large percentage of these words in the presence of noise Experiments on images degraded with uniform noise show that the correct rate of a commercial OCR system can be improved from to on the words in those clusters An error analysis is given that shows with further development correct rates in the range are achievable Fourth Symposium on Document Analysis and Information Retrieval, Las Vegas, NV, April 24-26, 1995, pp. 177-190.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Document image summarization without OCR

A system for selecting excerpts directly from imaged text without performing optical character recognition is described. The images are segmented to find text regions, text lines and words, and sentence and paragraph boundaries are identified. A set of word equivalence classes is computed based on the rank blur hit-miss transform. This information is used to identify stop words and keywords. Se...

متن کامل

Visual inter-word relations and their use in OCR postprocessing

A technique is presented that uses visual relationships between word images in a document to improve the recognition of the text it contains. This technique takes advantage of the visual relationships between word images that are usually lost in most conventional optical character recognition (OCR) techniques. The visual relations are defined to be the equivalence that exists between images of ...

متن کامل

Correlating degradation models and image quality metrics

OCR often performs poorly on degraded documents. One approach to improving performance is to determine a good filter to improve the appearance of the document image before sending it to the OCR engine. Quality metrics have been measured in document images to determine what type of filtering would most likely improve the OCR response for that document image. In this paper those same quality metr...

متن کامل

Combining multiple thresholding binarization values to improve OCR output

For noisy, historical documents, a high optical character recognition (OCR) word error rate (WER) can render the OCR text unusable. Since image binarization is often the method used to identify foreground pixels, a significant body of research has sought to improve image-wide binarization directly. Instead of relying on any one imperfect binarization technique, our method incorporates informati...

متن کامل

Template-free word spotting in low-quality manuscripts

As the OCR technique is not yet adequate for handwritten scripts with large lexicon, word spotting has been introduced as an alternative to OCR. This paper proposes a novel approach to word spotting that, instead of matching features of the word image to features extracted from predefined templates, uses the estimated posterior probability as the output of well trained classifier for spotting. ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2004

Improving OCR Performance With Word Image Equivalence

نویسندگان

چکیده

منابع مشابه

Document image summarization without OCR

Visual inter-word relations and their use in OCR postprocessing

Correlating degradation models and image quality metrics

Combining multiple thresholding binarization values to improve OCR output

Template-free word spotting in low-quality manuscripts

عنوان ژورنال:

اشتراک گذاری